Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination bigquery: rerelease 1s1t behind gate #27936

Merged
merged 30 commits into from
Jul 14, 2023

Conversation

edgao
Copy link
Contributor

@edgao edgao commented Jul 3, 2023

readers: please see individual commits.

plan:

  1. merge bugfixes into this branch
  2. release to test workspace -> slow rollout
  3. squash bugs as we find them

@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2023

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan and you've followed all steps in the Breaking Changes Checklist
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • The connector tests are passing in CI
  • You've updated the connector's metadata.yaml file (new!)
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@evantahler
Copy link
Contributor

evantahler commented Jul 5, 2023

@edgao I'm going to close this PR and do one just for the SQL change. We can talk about using more variables later.

Edit... I closed the wrong PR...

@evantahler evantahler closed this Jul 5, 2023
@evantahler evantahler reopened this Jul 5, 2023
}

public ParsedCatalog parseCatalog(ConfiguredAirbyteCatalog catalog) {
// this code is bad and I feel bad
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will create a ticket for fixing this, seems like too much of a mess to improve/untangle right now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this only affects 1s1t output anyway.

we also have #27798, which probably touches this part of the codebase - anything we do here should take that into account

@@ -52,7 +51,7 @@ public static BufferCreateFunction createBufferFunction(final S3AvroFormatConfig
return (pair, catalog) -> {
final AirbyteStream stream = catalog.getStreams()
.stream()
.filter(s -> s.getStream().getName().equals(pair.getName()) && StringUtils.equals(s.getStream().getNamespace(), pair.getNamespace()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true that even in the async world, we shouldn't have more than one namespace? Like even for CDC syncs, its multiple streams/table which are all being written to the same namespace? Does this logic have to consider any of the name mangling logic in the catalog parser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git says you might be the most qualified to answer that 😅 db28b13#diff-d68a5be057956e8fb021561f4c7111aa1e72eac6b87906eee8d7058a99f87163

but we explicitly support connections with two streams that have the same name but different namespace (iirc there's a DAT that verifies this...)

name mangling

depends on how exactly we're calling into this class, but StreamId has originalNamespace / originalName for this exact reason - as long as we're only using the mangled names inside the sql generator, it should be fine. (I haven't verified that we're doing it correctly though)

final TimePartitioning partitioning = TimePartitioning.newBuilder(TimePartitioning.Type.DAY)
.setField(JavaBaseConstants.COLUMN_NAME_EMITTED_AT)
.setField(chunkingColumn)
.build();

final Clustering clustering = Clustering.newBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evantahler this is where the clustering logic is, its an array so I wonder if clustering on all columns is crazy? or maybe just leaving off the data column?

Copy link
Contributor

@evantahler evantahler Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alex-gron maybe you have some advice? I don't think we should be clustering on more than one column in BigQuery perhaps? Either way, I think this is equivalent to what we are doing now - clustering by one column, COLUMN_NAME_EMITTED_AT, so I'm 👍 with it.

Screenshot 2023-07-05 at 3 10 22 PM

canonicalized = "_" + canonicalized;
}

// TODO this is probably wrong
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah... I cut a lot of corners back when I thought we didn't need to support name mangling in 1s1t >.>

we have a ticket to make sure 1s1t does this stuff correctly, right?

}

@Override
public String createTable(final StreamConfig stream, final String suffix) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and BigQueryUtils is where the current logic lives? But its using the Bigquery java sdk rather than SQL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry realizing this comment isn't clear - we had talked about colocating some of the common operations. Not necessarily right now, but maybe its a good idea colocate any "create table/namespace" logic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmhm. I don't have a strong opinion on how to make this happen - a few weeks back I was pretty dogmatic about (1) use sql for everything, (2) keep separate classes for "generate SQL" and "actually interact with the db", but I feel less strongly after actually seeing the code written

e.g. maybe we just merge sqlgenerator+destinationhandler? or kill off destinationhandler and add everything into some other existing class

edgao and others added 4 commits July 5, 2023 15:06
…cords (#27856)

* only run t+d code if it's enabled

* dockerfile+changelog

* remove changelog entry
#27898)

* catch null schema

* fix null properties

* clean up

* consolidate + add more tests

* try catch

* empty json test
* switch to checkedconsumer

* add unit test for buildColumnId

* use flag

* restructure prefix check

* fix build
edgao and others added 7 commits July 10, 2023 14:36
* more type-parsing fixes

* handle duplicates

* Automated Commit - Format and Process Resources Changes

* add tests for asColumns

* Automated Commit - Format and Process Resources Changes

* log warnings instead of throwing exception

* better log message

* error level

---------

Co-authored-by: edgao <edgao@users.noreply.github.com>
#28130)

* fifteen minute t&d

* add typing and deduping operation valve for increased intervals of typing and deduping

* Automated Commit - Format and Process Resources Changes

* resolve bizarre merge conflict

* Automated Commit - Format and Process Resources Changes

---------

Co-authored-by: jbfbell <jbfbell@users.noreply.github.com>
* Simplify and speed up CDC delete support [DestinationsV2]

* better QUOTE

* spotbugs?

* recompile dbt image for local arch and use that when building images

* things compile, but tests fail

* tests working-ish

* comment

* fix logic to re-insert deleted records for cursor comparison.

tests pass!

* remove comment

* Skip CDC re-include logic if there are no CDC columns

* stop hardcoding pk (#28092)

* wip

* remove TODOs

---------

Co-authored-by: Edward Gao <edward.gao@airbyte.io>
edgao and others added 3 commits July 12, 2023 16:25
* intiial implementation

* Automated Commit - Formatting Changes

* add second sync to test

* do concurrent things

* Automated Commit - Formatting Changes

* clarify comment

* minor tweaks

* more stuff

* Automated Commit - Formatting Changes

* minor cleanup

* lots of fixes

* handle sql vs json null better
* verify extra columns
* only check deleted_at if in DEDUP mode and the column exists
* add full refresh append test case

* Automated Commit - Formatting Changes

* add tests for the remaining sync modes

* Automated Commit - Formatting Changes

* readability stuff

* Automated Commit - Formatting Changes

* add test for gcs mode

* remove static fields

* Automated Commit - Formatting Changes

* add more test cases, tweak test scaffold

* cleanup

* Automated Commit - Formatting Changes

* extract recorddiffer

* and use it in the sql generator test

* fix

* comment

* naming+comment

* one more comment

* better assert

* remove unnecessary thing

* one last thing

* Automated Commit - Formatting Changes

* enable concurrent execution on all java integration tests

* add test for default namespace

* Automated Commit - Formatting Changes

* implement a 2-stream test

* Automated Commit - Formatting Changes

* extract methods

* invert jsonNodesNotEquivalent

* Automated Commit - Formatting Changes

* fix conditional

* pull out diffSingleRecord

* Automated Commit - Formatting Changes

* handle nulls correctly

* remove raw-specific handling; break up methods

* Automated Commit - Formatting Changes

---------

Co-authored-by: edgao <edgao@users.noreply.github.com>
Co-authored-by: octavia-approvington <octavia-approvington@users.noreply.github.com>
@edgao edgao marked this pull request as ready for review July 13, 2023 16:56
@edgao edgao requested review from a team as code owners July 13, 2023 16:56
@edgao edgao enabled auto-merge (squash) July 13, 2023 16:57
@octavia-squidington-iii
Copy link
Collaborator

destination-postgres-strict-encrypt test report (commit bf65992ea8) - ❌

⏲️ Total pipeline duration: 13mn57s

Step Result
Validate airbyte-integrations/connectors/destination-postgres-strict-encrypt/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-postgres-strict-encrypt docker image for platform linux/x86_64
Build airbyte/normalization:dev
./gradlew :airbyte-integrations:connectors:destination-postgres-strict-encrypt:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-postgres-strict-encrypt test

@octavia-squidington-iii
Copy link
Collaborator

destination-snowflake test report (commit bf65992ea8) - ❌

⏲️ Total pipeline duration: 40mn15s

Step Result
Validate airbyte-integrations/connectors/destination-snowflake/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-snowflake docker image for platform linux/x86_64
Build airbyte/normalization-snowflake:dev
./gradlew :airbyte-integrations:connectors:destination-snowflake:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-snowflake test

* move create raw tables

* better log message
@octavia-squidington-iii
Copy link
Collaborator

destination-mariadb-columnstore test report (commit bf65992ea8) - ❌

⏲️ Total pipeline duration: 91mn21s

Step Result
Validate airbyte-integrations/connectors/destination-mariadb-columnstore/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-mariadb-columnstore docker image for platform linux/x86_64
./gradlew :airbyte-integrations:connectors:destination-mariadb-columnstore:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-mariadb-columnstore test

@edgao
Copy link
Contributor Author

edgao commented Jul 13, 2023

cancelled the PR check run; running connector tests in https://github.com/airbytehq/airbyte/actions/runs/5548043438/jobs/10130507088

(--name destination-bigquery --name destination-bigquery-denormalized --name destination-gcs --name destination-postgres --name destination-postgres-strict-encrypt --name destination-redshift --name destination-s3 --name destination-snowflake )

only testing a subset of connectors so that they don't take literally forever.

@edgao
Copy link
Contributor Author

edgao commented Jul 13, 2023

bigquery, bigquery-denormalized, redshift, gcs, and s3 are all passing tests on CI.

Snowflake tests all failed due to connector crash on CI. I couldn't repro that locally; spot-checked a few tests and they all passed. We're not publishing snowflake from this branch anyway.

postgres/strict-encrypt are failing... but I think that's fine. We're not publishing them, and can deal with it later.

Will /approve-and-merge this pr first thing tomorrow.

@edgao
Copy link
Contributor Author

edgao commented Jul 14, 2023

/legacy-test connector=connectors/destination-snowflake

🕑 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/5549055473
✅ connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/5549055473
Python tests coverage:

Name                                                              Stmts   Miss  Cover
-------------------------------------------------------------------------------------
normalization/transform_config/__init__.py                            2      0   100%
normalization/transform_catalog/reserved_keywords.py                 15      0   100%
normalization/transform_catalog/__init__.py                           2      0   100%
normalization/destination_type.py                                    18      0   100%
normalization/__init__.py                                             4      0   100%
normalization/transform_catalog/destination_name_transformer.py     171     10    94%
normalization/transform_catalog/table_name_registry.py              174     34    80%
normalization/transform_config/transform.py                         195     48    75%
normalization/transform_catalog/utils.py                             51     14    73%
normalization/transform_catalog/dbt_macro.py                         22      7    68%
normalization/transform_catalog/catalog_processor.py                147     80    46%
normalization/transform_catalog/transform.py                         65     39    40%
normalization/transform_catalog/stream_processor.py                 603    407    33%
-------------------------------------------------------------------------------------
TOTAL                                                              1469    639    57%

Build Passed

Test summary info:

All Passed

@octavia-squidington-iii
Copy link
Collaborator

destination-starburst-galaxy test report (commit ae9a5b50a2) - ❌

⏲️ Total pipeline duration: 32mn18s

Step Result
Validate airbyte-integrations/connectors/destination-starburst-galaxy/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-starburst-galaxy docker image for platform linux/x86_64
./gradlew :airbyte-integrations:connectors:destination-starburst-galaxy:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-starburst-galaxy test

@octavia-squidington-iii
Copy link
Collaborator

destination-oracle-strict-encrypt test report (commit ae9a5b50a2) - ❌

⏲️ Total pipeline duration: 26mn34s

Step Result
Validate airbyte-integrations/connectors/destination-oracle-strict-encrypt/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-oracle-strict-encrypt docker image for platform linux/x86_64
Build airbyte/normalization-oracle:dev
./gradlew :airbyte-integrations:connectors:destination-oracle-strict-encrypt:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-oracle-strict-encrypt test

@octavia-squidington-iii
Copy link
Collaborator

destination-mssql-strict-encrypt test report (commit ae9a5b50a2) - ❌

⏲️ Total pipeline duration: 02mn28s

Step Result
Validate airbyte-integrations/connectors/destination-mssql-strict-encrypt/metadata.yaml
Connector version semver check
QA checks
Build connector tar
Build destination-mssql-strict-encrypt docker image for platform linux/x86_64
Build airbyte/normalization-mssql:dev
./gradlew :airbyte-integrations:connectors:destination-mssql-strict-encrypt:integrationTest

🔗 View the logs here

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-mssql-strict-encrypt test

@edgao
Copy link
Contributor Author

edgao commented Jul 14, 2023

/approve-and-merge reason="tested the relevant subset of connectors (#27936 (comment)) + subsequent legacy-test"

@octavia-approvington
Copy link
Contributor

Lets do it!
giddddy-up

@octavia-approvington octavia-approvington merged commit 934acaa into master Jul 14, 2023
@octavia-approvington octavia-approvington deleted the edgao/1s1t_redeploy branch July 14, 2023 14:34
efimmatytsin pushed a commit to scentbird/airbyte that referenced this pull request Jul 27, 2023
* Revert "Revert "Destination Bigquery: Scaffolding for destinations v2 (airbytehq#27268)""

This reverts commit 348c577.

* version bumps+changelog

* Speed up BQ by having 2 queries, and not an OR (airbytehq#27981)

* 🐛 Destination Bigquery: fix bug in standard inserts for syncs >10K records (airbytehq#27856)

* only run t+d code if it's enabled

* dockerfile+changelog

* remove changelog entry

* Destinations V2: handle optional fields for `object` and `array` types (airbytehq#27898)

* catch null schema

* fix null properties

* clean up

* consolidate + add more tests

* try catch

* empty json test

* Automated Commit - Formatting Changes

* remove todo

* destination bigquery: misc updates to 1s1t code (airbytehq#28057)

* switch to checkedconsumer

* add unit test for buildColumnId

* use flag

* restructure prefix check

* fix build

* more type-parsing fixes (airbytehq#28100)

* more type-parsing fixes

* handle duplicates

* Automated Commit - Format and Process Resources Changes

* add tests for asColumns

* Automated Commit - Format and Process Resources Changes

* log warnings instead of throwing exception

* better log message

* error level

---------

Co-authored-by: edgao <edgao@users.noreply.github.com>

* Automated Commit - Formatting Changes

* Improve protocol type parsing (airbytehq#28126)

* Automated Commit - Formatting Changes

* Change from T&D every 10k records to an increasing time based interval (airbytehq#28130)

* fifteen minute t&d

* add typing and deduping operation valve for increased intervals of typing and deduping

* Automated Commit - Format and Process Resources Changes

* resolve bizarre merge conflict

* Automated Commit - Format and Process Resources Changes

---------

Co-authored-by: jbfbell <jbfbell@users.noreply.github.com>

* Simplify and speed up CDC delete support [DestinationsV2] (airbytehq#28029)

* Simplify and speed up CDC delete support [DestinationsV2]

* better QUOTE

* spotbugs?

* recompile dbt image for local arch and use that when building images

* things compile, but tests fail

* tests working-ish

* comment

* fix logic to re-insert deleted records for cursor comparison.

tests pass!

* remove comment

* Skip CDC re-include logic if there are no CDC columns

* stop hardcoding pk (airbytehq#28092)

* wip

* remove TODOs

---------

Co-authored-by: Edward Gao <edward.gao@airbyte.io>

* update method name

* Automated Commit - Formatting Changes

* depend on pinned normalization version

* implement 1s1t DATs for destination-bigquery (airbytehq#27852)

* intiial implementation

* Automated Commit - Formatting Changes

* add second sync to test

* do concurrent things

* Automated Commit - Formatting Changes

* clarify comment

* minor tweaks

* more stuff

* Automated Commit - Formatting Changes

* minor cleanup

* lots of fixes

* handle sql vs json null better
* verify extra columns
* only check deleted_at if in DEDUP mode and the column exists
* add full refresh append test case

* Automated Commit - Formatting Changes

* add tests for the remaining sync modes

* Automated Commit - Formatting Changes

* readability stuff

* Automated Commit - Formatting Changes

* add test for gcs mode

* remove static fields

* Automated Commit - Formatting Changes

* add more test cases, tweak test scaffold

* cleanup

* Automated Commit - Formatting Changes

* extract recorddiffer

* and use it in the sql generator test

* fix

* comment

* naming+comment

* one more comment

* better assert

* remove unnecessary thing

* one last thing

* Automated Commit - Formatting Changes

* enable concurrent execution on all java integration tests

* add test for default namespace

* Automated Commit - Formatting Changes

* implement a 2-stream test

* Automated Commit - Formatting Changes

* extract methods

* invert jsonNodesNotEquivalent

* Automated Commit - Formatting Changes

* fix conditional

* pull out diffSingleRecord

* Automated Commit - Formatting Changes

* handle nulls correctly

* remove raw-specific handling; break up methods

* Automated Commit - Formatting Changes

---------

Co-authored-by: edgao <edgao@users.noreply.github.com>
Co-authored-by: octavia-approvington <octavia-approvington@users.noreply.github.com>

* Destinations V2: move create raw tables earlier (airbytehq#28255)

* move create raw tables

* better log message

* stop building normalization (airbytehq#28256)

* fix ability to run tests

* disable incremental t+d for now

* Automated Commit - Formatting Changes

---------

Co-authored-by: Evan Tahler <evan@airbyte.io>
Co-authored-by: Cynthia Yin <cynthia@airbyte.io>
Co-authored-by: cynthiaxyin <cynthiaxyin@users.noreply.github.com>
Co-authored-by: edgao <edgao@users.noreply.github.com>
Co-authored-by: Joe Bell <joseph.bell@airbyte.io>
Co-authored-by: jbfbell <jbfbell@users.noreply.github.com>
Co-authored-by: octavia-approvington <octavia-approvington@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants